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Abstract. Conventional training methods for neural networks Involve starting at a random location in the solution space of the 
network weights, navigating an error hyper surface to reach a minimum, and sometime stochastic based techniques {e g,, genetic 
algorithms) to avoid entrapment in a local minimum. It is further typically necessary to preprocess the data (e g., normalization) to 
keep the training algorithm on course. Conversely, Bayesian based learning is an epistemological approach concerned with 
formally updating the plausibility of competing candidate hypotheses thereby obtaining a posterior distribution for the network 
weights conditioned on the available data and a prior distribution. In this paper, we developed a powerful methodology for 
estimating the full residual uncertainty in network weights and therefore network predictions by using a modified Jeffery's prior 
combined with a Metropolis Markov Chain Monte Carlo method 


I.OINTRODUCTION 


We propose a methodology for estimating 
the full residual uncertainty in Artificial 
Neural Network (ANN) weights and 
therefore network predictions by using 
Bayesian probability analysis 4 (BRA), and a 
modified Jeffery’s prior combined with 
computational sampling methods including 
Markov Chain Monte Carlo. 

In this paper we restrict our attention to 
three layer feed-forward perceptrons, since 
they are sufficient 1,2 to serve as universal 
approximating functions. We further restrict 
attention to supervised learning. We will 
also not be considering feature extraction. 
As this effort is concerned with digital 
simulation based approaches, we will be 
using numerically driven discrete 
formulations (i.e. sums instead of integrals) 
throughout. 

Artificial neural networks 
An Artificial Neural Network can be thought 
of as a computational model which consists 
of three layers of processing units with full 
interconnection between layers such that 
each component of an input vector is scaled 
individually for each middle layer unit, and 
the scaled components are then summed 
and passed through a transfer or activation 
function in each unit in the middle layer and 
then the middle layer outputs constitute 


another vector as the input to the output 
layer which is likewise scaled independently 
and individually for each output unit. The 
units in the output layer are typically (but not 
necessarily) simple linear functions. The 
input vectors for each layer also contain an 
implicit component of 1 to serve as an input 
bias. Figure 1 depicts a conventional three- 
layer feed forward perceptron network. 



Figure 1 - Three Layer Feed Forward Network 

Expressed as a mathematical model for the 
simplest case of a one input, one hidden 
unit, and one output we can write 

J'-v 0 +v 1 ^w 0 +w,r) (1) 

where y is the output, w are the weights 
(scale factors) from the inputs x (which 
includes an implicit bias input of 1) to the 
hidden layer, v are the weights from the 
transfer function outputs (which includes an 
implicit bias input of 1) to the output layer, 
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and y/ is the non-linear activation function. 
Note the location of the bias components for 

V and w 0 . 

For arbitrary numbers of inputs, hidden 
units, and outputs, equation (1) takes the 
form 


Nh f Ni \ 

y k = 'v + Z w »h + Z w -h x > 

*=i v i=i j 


( 2 ) 


and can be written as a matrix formulation. 


Bayesian Probability 

BPA requires determining a Universe of 
Discourse (UOD) which is a set of 
hypotheses that are ranked on a common 
scale of [0,1] in terms of their relative 
strength as an explanator of the observed 
data. This is done both for a family of 
competing models, and for competing sets 
of the parameter values for each model. 

The basic process is: 

• determine a prior distribution for the 
model parameters of a given model 

• determine a probabilistic likelihood 
function for the phenomena under study 

• determine a UOD for our analysis 

• determine a posterior distribution for the 
hypotheses in the UOD 

• make inferences from the posterior 
yielding full accounting for the residual 
uncertainty of the parameters 

This probability is then interpreted as a 
measure or weighting of the amount of 
inferential support 3 from the observed data 
for the hypotheses normed to entail the 
chosen UOD. 

We wish to stress that the choices of prior, 
likelihood, and UOD represent degrees of 
freedom for the researcher; BPA only 
promises to give us the most logically 


justifiable results contingent on these 
choices 34 . 

Likelihood Model 

The likelihood function is entirely dependent 
on the phenomena under study and must be 
constructed to yield the conditional 
probability of any observed data for a 
chosen model and values of its parameters. 

Bayesian Prior Selection 

Bayesian prior selection is a vast subject. 
Typically one may express ignorance 
concerning the current problem, or may 
possess some information that may be 
codified into a prior, e.g. by the Principle of 
Maximum Entropy 4,1 0,1 \ In any case, the 
prior expresses our starting information 
concerning the parameters of the likelihood 
function. 

Posterior Distribution 

Bayesian posterior determination proceeds 
by computing Bayes Rule 4,13 

P(w\D M)= 

Zw^l M)L(D\wM) 

( 3 ) 

where P(w \D,M) \s the posterior 
probability of the model parameters if 
conditioned on the observed data D and the 
choice of model M, P t) (w \ M) denotes the 

prior probability of the parameters if which 
summarizes all knowledge of if for this 
model brought forward into the present 
analysis, and L(D \ w,M) is the likelihood 
of the data being observed for the model 
given that the parameters have the value if 
. The denominator of (3) is known as the 
evidence for the model: 

P(D\M) = Y {i) P„(w\M)HD\w,M) (4) 
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and is the probability of the data given the 
model marginalized by the likelihood 
function over the hypothesis space {if}. 



Parameter Estimation 

Bayesian parameter estimation proceeds by 
evaluating and scoring on a common 
probability scale, each value of the 
parameters in the UOD of model 
parameters using (3), thereby producing a 
normalized probability distribution (or mass 
function) over the UOD. 

Model Selection 

From (4) and using Bayes Rule (3) we can 
write 


P{M\D) = 


P»mP{D \M) 


( 5 ) 




= 'Z {i] P(r\wM,D)P 0 (w\M) ( 6 ) 

using the product rule of probability 4 . This is 
the predictive distribution for observing the 
data t formally treating the models 
parameters as nuisance parameters. For 
example if P(t \ M,D ) is materially different 
than the likelihood we might suspect our 
choice of likelihood function. 

We might also form a simple expectation 
such that: 


This posterior distribution over a separate 
UOD {M} for models allows selection of the 
model which best explains the observed 
data, again also yielding a full 
characterization of the residual uncertainty 
conditioned upon a choice of prior for the 
models and available data. 

Occam Factors 

BPA has an interesting feature where model 
selection is concerned in that it contains an 
explicit built in penalty for more complex 
models over simpler models. This feature is 
known as an Occam factor 11 and is a 
consequence of forming the ratio of the 
evidence for two competing models, one of 
more complexity than the other using 
equation (4). A factor that emerges in the 
ratio calculation will penalize 11 the more 
complex model due to its’ greater expanse 
of parameter space that will be ultimately 
ruled out by conditioning on the available 
data. 

Bayesian Predictions 

Bayesian inference or prediction is generally 
concerned with the formal marginalization 
over the hypothesis space {w } of a given 
model. For example we might wish to 
check the probability of some desired output 
data 1 conditioned on our model, and our 
training data such that 


( 7 > 

where P(w \ D,M) \s given by (3), y(w) 
could be the output of an ANN with 
parameters w , and the expectation is 
conditioned on the training data set D and 
choice of network represented by M. 

Learning lor Neural Networks 

There remains the issue of determining the 
values of the weights w and v in (2) 
conditioned on the available data and any 
other relevant information. This is the 
central problem of ANN learning. We would 
like to point out that it is possible to 
determine the vector v as function of y and 
w with appropriate mathematical technique. 

Backpropagation 

The most common conventional (non 

Bayesian) approach to ANN learning is to 

concern oneself with an error function such 

as: 

1 ** * ' f r i-> 

£ = < 8 > 

Jy p *=i /,=! 

As written, this is the mean squared error 
per pattern average for all outputs where N k 
is the number of outputs, t pM denotes the 
desired output for the pth input pattern, y„* 
denotes the k* h observed output for the p h 
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input pattern, and N p denotes the total 
number of observations or training data 
patterns. 

The basic stratagem of backpropagation is 
to substitute (2) for the y A * in (8), and then 
use error gradient information for the 
weights in order to use a numerical error 
reduction algorithm {typically conjugate 
gradient ) to adjust the weights and achieve 
some error minimum. 

Two outstanding issues emerge in this 
approach to learning: 

• what is the proper minimum for the 
residual error that the network 
produces? This is the issue of 
regularization which is inhibition of 
network training to the noise component 
of the signal. 

• What network model best explains the 
observed data? 


Bayesian Learning forANNs 

In broad strokes, Bayesian Learning 
requires choosing a prior distribution over 
the network weights, framing a probabilistic 
formulation for an ANN model or models 
then using {2), (3), and (4) to determine the 
best network along with a posterior 
probability distribution over the weights for 
the selected model with a full 
characterization of the residual uncertainty 
in both. We describe our solution to these 
in section 2. 

Monte Carlo Simulation 

Because the Bayesian posterior - which is 
the sought after distribution - is a priori 
unknown, we must resort to some form of 
search strategy to find it. Monte Carlo 
simulations were originally developed to 
provide numerical integration of functions 
but can be used in a variety of ways to 
sample the solution space and determine 
the probability distribution for our chosen 
UOD, which we are choosing 


probabilistically as a Markov Chain rather 
than deterministically. 


2.0 METHODOLGY 

The principle challenge of our methodology 
is to combine Bayesian Probability, 
mathematical models of AN Ns, and 
simulation based methods of solution 
search to determine a joint posterior 
probability distribution for the hidden 
network weights and any other parameters 
such as the noise or stochastic contribution 
to the observed data. Thus is created all 
that is necessary to make predictions with 
full accounting of the residual uncertainty in 
the inferred network. 

Probabilistic Likelihood Model 

We must determine a likelihood model to be 
used in equations (3),(4) 

To address the issues of regularization 
(over fitting inhibition) and to account for 
actual residual stochasticity in the data, we 
choose to compose our “meta-model” as a 
linear combination of deterministic and non- 
determ inistic or stochastic components. 

This requires expanding our hypothesis 
space to also ascertain the correct amount 
of stochasticity or loosely noise in the input 
data. To clarify, we seek to model the 
residual stochastic {loosely noise) 
component of the input data or signal as a 
form of regularization. To that end we model 
the likelihood of the target data (training 
pattern) less the output of the candidate 
network output y as: 

/.(/ - I h>,X) = 0(|/ - X) (9) 

where G(.) is a Multivariate Gaussian 
probability density, i is the training pattern 
output data, and the components of .£(»’) 
are given by (2). That is to say, our 
likelihood model for the difference between 
the target (training) data and the candidate 
network output is to be modeled as a form 
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of Multivariate Gaussian white noise. Note 
that since white noise is un correlated, our 
likelihood model is conditionally 
independent between training inputs; we 
therefore model the stochastic part of the 
output as conditionally independent 
between training patterns while allowing for 
the possibility of correlations between the 
components of the output vector. We 
therefore write for the likelihood of the 
training set for a given choice of model 
parameters {w ,1}: 

L(t-y\w,X) = ]\G(t p -y p ,l.) ( 10 ) 

p = i 

Note that we have expanded our parameter 
search according to Bayesian principles to 
include the multivariate noise contributions 

j 

the full covariance matrix for the difference 
between the observed output and modeled 
output. The full covariance matrix provides 
for possible correlations among the 
elements of the vector i - y which is the 

modeled stochastic component associated 
with each training pattern vector . We 
use equation (10) for the likelihood function 

L(, ) in (3) where the data D is now 
understood to be stochastic part of the 
training data i.e., 1 — y(w) thus we are 
absorbing the deterministic part of the signal 
into the network output in such a manner as 
to maximize the probability associated with 
the stochastic residue of the training input 
via our likelihood function 

Using ANNs in this fashion can be thought 
of as a form of Bayesian non parametric 
probabilistic modeling with the choice of 
activation function serving as the 
appropriate basis functions 10 . (N.B.: the 
term "Non- Para metric Bayesian” has 
acquired a different meaning in the literature 
than what we are implying in this study) 


Choice of Prior Distribution 

Our methodology addresses the choice of a 
prior distribution by choosing a (modified) 
Jeffreys' prior 5,6,11 to express partial 
ignorance over the parameter space but 
also because it discriminates against 
excessively large model parameter values 12 
(called shrinkage in the statistics literature, 
and weight decay in the ANN literature). 

Jeffreys’ prior is invariant to transformations 
of the parameter space and is related to the 
expected value of the Fisher Information 
Matrix. For scale parameters, this becomes 

P(w\M) = c 0 /w (11) 

where c 0 is a normalizing constant. This 
represents a density which is apportioned 
equally per decade of its scale and is 
therefore scale invariant. While the 
continuous version of this density is strictly 
improper (the cumulative distribution 
integral diverges), it is straightforward to 
construct a normalized discrete probability 
mass function over some chosen (always 
finite) UOD. 

We consider the sought after network 
weights \v , and the noise contribution Z to 
both be scale parameters. We modify the 
Jeffery s’ priors for both according to the 
following considerations: 

• We impose a minimum value for each 
component of >7 , and the diagonal 
components of Z such that any values 
less than these cutoffs decay smoothly 
to zero 

• We normalize the resulting discrete 
distributions over some reasonable 
range 

• Since weight parameters i7 may be 
negative, we actually use the absolute 
value in (11), keeping the distribution in 
that case symmetric about the origin. 

The resulting distributions have the general 
form of Figure 2 below. 
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Figure 2 - Modified Jeffery’s* prior density 


For the diagonal components of the noise 
contribution parameter E we have only the 
positive (properly normalized) part of the 
curve. We do not discriminate against small 
values of the off-diagonal (covariance) 
elements of Z 

Strictly speaking, the cutoffs are somewhat 
arbitrary and good candidates for a 
hyperparameterized prior for each (in 
contemporary Bayesian fashion), but that is 
not included in this analysis. The actual 
choices made were such that the cutoff 
value were chosen sufficiently small to be 
hopefully good for a wide choice of 
problems. 

These minimum are thought reasonable on 
the basis that network weights which are too 
small lead to uninteresting solutions, and if 
the noise contribution is too small then we 
are in effect eliminating that component of 
the modeling. In both cases, parameter 
values of 0 are clearly uninteresting. 

MCMC 

Choosing a UOD by Monte Carlo (MC) 
simulation tends to take one of two basic 
tracks: 


Conventional MCMC sampling techniques 
such as the Metropolis, Metropolis- 
Hastings, and Gibbs sampling are basically 
of type 1, 

Independence Chain sampling, and 
Importance sampling are of type 2. 

In our version of grid or mesh sampling we 
used a coarse grid to fine grid progression 
to characterize the posterior distribution and 
locate promising regions which were 
subsequently explored with a finer grid. This 
approach suffers from exponential increase 
in computational effort with increased 
dimensionality of the parameter space. The 
basic approach is to compute equation (3) 
for each grid point in the UOD, thus 
achieving a coarse grained posterior 
probability distribution. Finer grained 
computations over more promising regions 
then ensued. It is in this sense that a non- 
local or global rule is used in choosing 
sample points. 

In a random walk oriented MCMC approach 
we used a Metropolis algorithm which 
generally is able to find promising regions, 
is less computationally demanding, but 
often may not give a complete 
characterization of the posterior distribution 
and in can yield lower quality network output 
when compared against the training data 
than grid sampling. The Metropolis 
algorithm operates by choosing its next 
point by constructing a Markov Chain via 
sampling from a local proposal density 
which is centered on the current point and 
for our study is a Multivariate Gaussian of 
the same dimension as the hypothesis 
space. The new point is then accepted with 
probability 


P = mim 1, 


/'(»■„ | D.M) ] 

I d,m)1 


( 12 ) 


1. Start at a random location and use a 
local rule to choose the next location 


A chain of N points {vP } is thus determined 
from this algorithm such that if a new 
candidate point is rejected, a copy of the 
current point is added to the chain. In this 
fashion, points are accumulated according 


2. Use a global rule to choose the all 
locations 
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to their relative probability, the duplication of 
points serving to increase their respective 
weighting for inferences from the chain such 
as expectations according to 

= 03 ) 

where jp(w> is given from (2), and the 
inferential locus is the chain { w }. 

Thus the UOD in our methodology is 
determined stochastically by the sampling 
algorithm. Each sampled point accepted or 
rejected may be retained so as provide for a 
proper discrete probability distribution of the 
form 

PD = {{w ] , P] },...,{w k , Pk }} (14) 

for a distribution with K elements to be used 
in equations (6), and (7). Conventional 
MCMC doctrine uses (13). The proposal 
density used in (12) must be tuned in order 
to achieve acceptance rates of between 
25% and 50% as recommended by the 
literature 14 . Typically MCMC sampling 
includes a warm up time to allow the chain 
to begin properly representing the target 
posterior distribution and so the warm up 
time is not included in equation (13). 


3.0EXPERIMENTS 

General Remarks 

We performed experiments on both 
simulation generated data and real world 
data. Simulation generated data is 
especially beneficial for validation of the 
methodology since the noise 
un contaminated input data is readily 
available. Naturally, performing well in such 
a case gives one confidence in attacking 
real-world problems where the noise 
component may be unknown. 


Noisy Sine Wave 

This experiment is for a noisy sine wave 
and was pattern matched with a network 
consisting of 1 input, 2 hidden units, 1 
output, with Gaussian noise and randomly 
sampled for 1000 trials 



Figure 3 - Noisy Sine Wave 

In Figure 3 the red sine wave is the true 
denoised signal, the noisy red line is the 
actual input, and the blue line is the 
prediction of the network. 



Figure 4 - Noise only (Sine Wave) 

In Figure 4 the red curve is the true noise in 
the signal, and the blue line is the actual 
input less the prediction of the network, i.e, 
is the modeled noise resulting from the 
network prediction. 

Decaying Exponential 

This experiment is for a noisy decaying 
exponential curve and was pattern matched 
with a network consisting of 1 input, 1 
hidden unit, 1 output, with Gaussian noise. 
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Figure 5 - Noisy Decaying Exponential 

In Figure 5 the red line is the true denoised 
signal, the noisy red line is the actual input, 
and the blue line is the prediction of the 
network. 


V 



Figure 6 - Posterior for the Two Weights for 
Decaying Exponential 

In Figure 6 we have the posterior 
distribution for the two hidden weights for 
the decaying exponential problem 
determined by grid sampling. There are 
clearly two branches of significant 
probability above some floor. 



Q 5.0(5 Q,£ 0,15 5.2 


Figure 7 - Metropolis Markov Chain Samples 
for Two Weights for Decaying Exponential 

In Figure 7 we have the sample points from 
the Metropolis algorithm for the two hidden 
weights for the decaying exponential 
problem. Note the correspondence 
between Figures 6 and 7. The Metropolis 
algorithm has found one branch of the 
solution depicted in Figure 6. 



Figure 8 - Metropolis Markov Chain Process 
for Two Weights in Decaying Exponential 

In Figure 8 we show the processes for the 
sample for the two weights of the decaying 
exponential fit. Note that the process for one 
of the parameters has found the correct 
value after a warm up of approximately 
1500 steps in the Markov Chain The other 
parameters’ process is more of a random 
walk due to its wide range of acceptable 
values (compare with Figures 6 and 7). 
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Figure 9 - Metropolis Markov Chain Process 
for Noise Parameter 

In Figure 9 we show the process for the 
sample for the noise component parameter 
of the decaying exponential fit. We note 
that the process for the noise parameter has 
found the correct value of d*s{0,0.05 2 ) after 
a warm up of approximately 1500 steps in 
the Markov Chain. 


0,05 

0,04 


0.03 



0,9 0,52 0,54 0.9& 0,99 l 1.02 1*<M 

Figure 10 - Uncertainty in Network 
Prediction 

In Figure 10 we show the normalized 
probability distribution from the resultant 
Markov Chain for the first output point. The 
denoised output for the first point is actually 
1 . 


Database at UC Irvine and consists of 7 
inputs and 3 outputs. It was processed with 
a network of 32 hidden units and only 500 
MCMC samples after a warm up of 500 
samples. Also included in the sampling for 
this problem were all the components of the 
full covariance matrix for the outputs 
including the off-diagonal components {ref, 
equations 9,10) 



Figure 11 - Comparison between network 
output (blue) and training input (red) for 3 
outputs in Concrete problem 

In Figure 1 1 we compare the training data 
(red) with the network prediction (blue) for 
the concrete problem 



Figure 1 2 - Typical Metropolis Samples for 
Two of the Weights in Concrete problem 


Concrete Problem 

This problem known as the Concrete 
Problem is from the Machine Learning 
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4.0 DISCUSSION / CONCLUSIONS 

The selected experiments showed good 
responses of the methodology for 
surprisingly few numbers of trials. In cases 
where the actual strength of the noise 
component of the signal was known, the 
method reliably inferred a value very close 
to the actual value, with small variance on 
repeated trials. In all experiments we note 
the fairly rapid convergence of the chain to 
promising potential solutions starting from 
randomize initial locations. The 
implemented MCMC sampling algorithm 
was admittedly crude and typically achieved 
acceptance fractions of between 10% to 
15% well below that recommended by the 
literature {25% to 50%). 
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Introduction 


• I am a Modeling and Simulation Ph.D. student at 
ODU and an employee of Alion Science and 
Technology currently researching issues in 
computation Bayesian probability analysis in the 
context of problems in Machine Learning. 

. My research thrust consists of a combination of 

- Bayesian Model Testing 

- Bayesian Parameter Estimation 

- Adaptive Monte Carlo sampling 

- Information Theoretic Probabilistic analysis 

- ANN Constraint Analysis 

- Probabilistic Inference 
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Outline 


. ANNs 
. BPA 

. ANN (Machine) Learning 
. MC simulation 
• Markov Chains 
. MCMC 
. Methodology 

- Likelihood Modeling 

- Prior Distribution for Weights 

- MCMC Sampling Technique 

- MCMC Proposal Distribution 

. Experiments 

- Exponential Curve 

- Sine curve 

- Concrete Problem 

. Discussion/Conclusions 


Neural Networks 

• Regression - Determines a regression curve to sample data 

• Classification - Maps a non-linear decision boundary to a 


linear decision boundary in the feature space of the non- 
linear basis functions (e.g. sigmoids) 
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Supervised Learning Probiem 



Problem! Not LMSE -> Regularization: Theoretical basis? 





Bayesian Statistics in a Nutshell 




Bayes Rule: Natural Learning Law from Data 


Prior Distribution 


Likelihood 


Posterior/Conditional Marginal Data 


Distribution Probability or Evidence Universe of Discourse 


Bayesian Statistics in a Nutshell 



of Data Discourse 
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Bayesian Marginalization 

Bayesian prediction/inference is usually difficult 
and/or expensive -> requires Marginalization 

^Conditional Inference (Distribution) 


C(Wi|£0 = X„ » piw x ,w 2 ,...,w n \D) 


P(d \D,M) = P(d, w\D,M) = y |u) P(w\D,M)L(d\w,M) 



Predictive Distribution 


Posterior/Conditional 

Distribution 


Likelihood 


Bayesian Probability : Researcher 
Degrees of Freedom 

. Initial Conditions: Prior Distributions 
* Phenomena: Likelihood function 

. Purpose: Inferences / Predictions 
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Discrete Markov Chains 


A sampling sequence {x} from a proposal 
distribution forms a Markov chain if it has the 
property: 


I ".,*1 ) = Pp( X n I *«-|) 


Markov Chain Monte Carlo is a chain construction 
technique for converging a chain to some target 
distribution p t which is unknown in advance , s.t.: 


p,{x) = lim {n x , x} <- p (x n \ x n _ } ) 


♦ Thus 


{/(*>) = I >m £ P, fa )/fa ) « — £ /fa ) 

tt l* £• M 


Metropolis Sampling 


Constructs a Markov chain by proposing the next 
point in the chain s.t: 


Accepting the proposed next point with 
probability: 


P - mint 1, 


D,M) 


* Where 


p(w I D, W)=c L(D\w, M)k{w I M) 
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Methodology 


• Likelihood function 

• Prior density for network weights 

• MC Sampling Technique 

• Predictions / Inferences 


Neural Networks 
Likelihood Function 

• Likelihood of training set for model choice: 


L(t-y\w,Y) = N<\t-y(w\Z) 
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Modified Jeffreys’ Prior 

• Jeffreys’: Concerned with specifying transformation 
invariant priors to represent ignorance. 

• Must address both location and scale parameters. 


. Must define ignorance by a specific transformation (Jaynes). 


u’= // + a,; o' = fo a) « g(//. ct')| 


• Consider 

• General Solution is 


P ( p , a } = const x — I 
a 


• ANN ; Noise-> 

scale parameters 

* Modify to discriminate against 

near zero values: 



MC Sampling Technique 


memesd,;) 


Initial Proposal Distribution 
Scaling Rules 

if( acceptance fraction < 25%) -> increase memesd 

if( acceptance fraction > 50%) -> decrease memesd 
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Inferences / Predictions 


Expectations 


{/(*)) = Hm 2] p, (*„ )/„ {x n ) * — £ f„ (x „ ) 


Posterior Distributions: 


histogram ({/„ (w„ )}) 


E.g,, 

Network Output 
Weights 

Noise Parameter 

Vector Noise Covariance Matrix 
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Noisy Exponential Curve 

Simple network of 1 input, 1 hidden unit, 1 output 
Solution state space is two weights 
Actually data is 




y - expj 

V 10 J 




Noisy Exponential Curve 

♦ Metropolis sampling with proposal = WfllWWil!B31 
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Noisy Sine Curve 

Network of I input, 2 hidden unit, I output 
Solution state space is four weights 
Actually data is 


i> - ^ + 0 4sin(2jzv) + ,V(0,0.2 : ) = v (l + v,|/(w a + w„_v) + v ; ^(h' 4 + w^x 



Noisy Sine Curve 


♦ Metropolis sampling with proposal = 








0.S85 O-Sfl 0-5*! 
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Concrete Problem 

* Network of 7 inputs, 64 hidden units, 3 outputs 

* Solution state space is (7+1)64 = 512 weights 

* Actually data is real-world -> “noise” component 
unknown 



Concrete Problem 
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Concrete Problem 

. Comparative Predictions of 8,16,32,64 hidden units 



Discussion / Conclusions 

• A promising start! 

. Surprisingly good results for small sample sizes 

• Other options for priors: 

- Constraints 

- Non Linear solvers 

• Proposal Density Scaling Issues 

• Regularization: Model Selection appears best. 
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